{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each machine learning algorithm has strengths and weaknesses. A weakness of decision trees is that they are prone to overfitting on the training set. A way to mitigate this problem is to constrain how large a tree can grow. Bagged trees try to overcome this weakness by using bootstrapped data to grow multiple deep decision trees. The idea is that many trees protect each other from individual weaknesses.\n",
    "![images](../images/baggedTrees.png)\n",
    "\n",
    "In this video, I'll share with you how you can build a bagged tree model for regression."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import Libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "# Bagged Trees Regressor\n",
    "from sklearn.ensemble import BaggingRegressor"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Load the Dataset\n",
    "This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015. The code below loads the dataset. The goal of this dataset is to predict price based on features like number of bedrooms and bathrooms"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv('https://raw.githubusercontent.com/mGalarnyk/Tutorial_Data/master/King_County/kingCountyHouseData.csv')\n",
    "\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This notebook only selects a couple features for simplicity\n",
    "# However, I encourage you to play with adding and substracting more features\n",
    "features = ['bedrooms','bathrooms','sqft_living','sqft_lot','floors']\n",
    "\n",
    "X = df.loc[:, features]\n",
    "\n",
    "y = df.loc[:, 'price'].values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Splitting Data into Training and Test Sets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note, another benefit of bagged trees like decision trees is that you don’t have to standardize your features unlike other algorithms like logistic regression and K-Nearest Neighbors. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bagged Trees\n",
    "\n",
    "<b>Step 1:</b> Import the model you want to use\n",
    "\n",
    "In sklearn, all machine learning models are implemented as Python classes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This was already imported earlier in the notebook so commenting out\n",
    "#from sklearn.ensemble import BaggingRegressor"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<b>Step 2:</b> Make an instance of the Model\n",
    "\n",
    "This is a place where we can tune the hyperparameters of a model. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "reg = BaggingRegressor(n_estimators=100, \n",
    "                       random_state = 0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<b>Step 3:</b> Training the model on the data, storing the information learned from the data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Model is learning the relationship between X (features like number of bedrooms) and y (price)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "reg.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<b>Step 4:</b> Make Predictions\n",
    "\n",
    "Uses the information the model learned during the model training process"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Returns a NumPy Array\n",
    "# Predict for One Observation\n",
    "reg.predict(X_test.iloc[0].values.reshape(1, -1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Predict for Multiple Observations at Once"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "reg.predict(X_test[0:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Measuring Model Performance"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Unlike classification models where a common metric is accuracy, regression models use other metrics like R^2, the coefficient of determination to quantify your model's performance. The best possible score is 1.0. A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "score = reg.score(X_test, y_test)\n",
    "print(score)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tuning n_estimators (Number of Decision Trees)\n",
    "\n",
    "A tuning parameter for bagged trees is **n_estimators**, which represents the number of trees that should be grown. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# List of values to try for n_estimators:\n",
    "estimator_range = [1] + list(range(10, 150, 20))\n",
    "\n",
    "scores = []\n",
    "\n",
    "for estimator in estimator_range:\n",
    "    reg = BaggingRegressor(n_estimators=estimator, random_state=0)\n",
    "    reg.fit(X_train, y_train)\n",
    "    scores.append(reg.score(X_test, y_test))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize = (10,7))\n",
    "plt.plot(estimator_range, scores);\n",
    "\n",
    "plt.xlabel('n_estimators', fontsize =20);\n",
    "plt.ylabel('Score', fontsize = 20);\n",
    "plt.tick_params(labelsize = 18)\n",
    "plt.grid()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that the score stops improving after a certain number of estimators (decision trees). One way to get a better score would be to include more features in the features matrix. So that's it, I encourage you to try a building a bagged tree model "
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
